GFSF: A Novel Similarity Join Method Based on Frequency Vector

نویسندگان

  • Ziyu Lin
  • Daowen Luo
  • Yongxuan Lai
چکیده

String similarity join is widely used in many fields, e.g. data cleaning, web search, pattern recognition and DNA sequence matching. During the recent years, many similarity join methods have been proposed, for example Pass-Join, Ed-Join, Trie-Join, and so on, among which the Pass-Join algorithm based on edit distance can achieve much better overall performance than the others. But Pass-Join can not effectively filter those candidate pairs which are partially similar. Here a novel algorithm called GFSF is proposed, which introduces two additional filtering steps based on character frequency vector. Through this way, the number of pairs which are only partially similar are greatly reduced, thus greatly reducing the total time of string similarity join process. The experimental results show that the overall performance of the proposed method is better than Pass-Join.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Power-Law Based Estimation of Set Similarity Join Size

We propose a novel technique for estimating the size of set similarity join. The proposed technique relies on a succinct representation of sets using Min-Hash signatures. We exploit frequent patterns in the signatures for the Set Similarity Join (SSJoin) size estimation by counting their support. However, there are overlaps among the counts of signature patterns and we need to use the set Inclu...

متن کامل

A novel cooperative game between client and subcontractors based on technical characteristics

Large projects often have several activities which are performed by some subcontractors with several skills. Costs and time reduction and quality improvement of the project are very important for client and subcontractors. Therefore, in real large projects, subcontractors join together and form coalitions for improving the project profit. A key question is how an extra profit of cooperation amo...

متن کامل

A novel hybrid method for vocal fold pathology diagnosis based on russian language

In this paper, first, an initial feature vector for vocal fold pathology diagnosis is proposed. Then, for optimizing the initial feature vector, a genetic algorithm is proposed. Some experiments are carried out for evaluating and comparing the classification accuracies which are obtained by the use of the different classifiers (ensemble of decision tree, discriminant analysis and K-nearest neig...

متن کامل

A Novel Method for Tracking Moving Objects using Block-Based Similarity

Extracting and tracking active objects are two major issues in surveillance and monitoring applications such as nuclear reactors, mine security, and traffic controllers. In this paper, a block-based similarity algorithm is proposed in order to detect and track objects in the successive frames. We define similarity and cost functions based on the features of the blocks, leading to less computati...

متن کامل

A novel method for detecting structural damage based on data-driven and similarity-based techniques under environmental and operational changes

The applications of time series modeling and statistical similarity methods to structural health monitoring (SHM) provide promising and capable approaches to structural damage detection. The main aim of this article is to propose an efficient univariate similarity method named as Kullback similarity (KS) for identifying the location of damage and estimating the level of damage severity. An impr...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016